GH-50186: [C++][Gandiva] REPLACE throws "Buffer overflow for output string" for results larger than 64 KB #50187
GH-50186: [C++][Gandiva] REPLACE throws "Buffer overflow for output string" for results larger than 64 KB #50187lriggs wants to merge 2 commits into
Conversation
|
|
Gandiva's REPLACE hardcoded a 65535-byte output cap, throwing "Buffer overflow for output string" whenever the result exceeded 64 KB. Size the output buffer to the exact result instead, by counting non-overlapping matches of from_str: text_len + num_matches * (to_str_len - from_str_len). Removes the arbitrary cap; the internal replace_with_max_len variant and its bounds checks are unchanged. Gandiva variable-length output uses int32 offsets, so a single output string cannot exceed INT_MAX (2 GB). Guard that boundary explicitly with a clear error message instead of letting the int32 size cast wrap silently (which could otherwise lead to under-allocation). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| from_str_len, to_str, to_str_len, 65535, | ||
| out_len); | ||
| from_str_len, to_str, to_str_len, | ||
| static_cast<gdv_int32>(max_length), out_len); |
There was a problem hiding this comment.
Do we even need this argument in the function signature?
There was a problem hiding this comment.
Yes, thats how the function know how much memory to allocate for the output string.
| EXPECT_EQ(std::string(out_str, out_len), "TestString"); | ||
| EXPECT_FALSE(ctx.has_error()); | ||
|
|
||
| // Large output (>64 KB) must not overflow: buffer is sized to the exact result. |
There was a problem hiding this comment.
I'd feel more comfortable with few specific edge cases but I would not block the PR. Examples:
- INT_MAX resulting size
- 0 resulting size
There was a problem hiding this comment.
Tested one byte past the limit (INT_MAX + 1) rather than exactly INT_MAX, because a result of exactly INT_MAX is allowed and would force a real ~2 GB arena allocation — not something to do in a unit test. The boundary test confirms the guard triggers at precisely the right point.
| gdv_int32 to_str_len, gdv_int32* out_len) { | ||
| // Count non-overlapping matches to size the output buffer exactly, so large | ||
| // results are not capped by an arbitrary limit. | ||
| gdv_int64 num_matches = 0; |
There was a problem hiding this comment.
We just reviewed base64 encoding few weeks back and didn't accept double pass variant. See if you can reduce the logic back to single pass or provide benchmarks for single-vs-double variants (0/few/lots replacements) as perhaps resizing is more costly.
There was a problem hiding this comment.
old vs new is measured without a separate build: the old wrapper was literally replace_with_max_len_utf8_utf8_utf8(…, 65535, …), and that inner function is unchanged. So new = replace_utf8_utf8_utf8 (counting scan + exact alloc) and old = replace_with_max_len_utf8_utf8_utf8 given an adequately-sized buffer (the pre-change algorithm, no scan). The delta is purely the O(n) scan.
Results
| case | text_len | matches | new ns/call | old ns/call | delta |
|---|---|---|---|---|---|
| small/dense | expand a->ab | 256 | 256 | 3013.5 | 2250.3 |
| small/sparse | expand a->ab | 256 | 4 | 1738.6 | 922.1 |
| medium/dense | expand a->ab | 65536 | 65536 | 753132.4 | 573520.9 |
| medium/sparse | expand a->ab | 65536 | 1024 | 436435.6 | 224018.2 |
| large/dense | expand a->ab | 4194304 | 4194304 | 52924267.0 | 40737654.2 |
| large/sparse expand | a->ab | 4194304 | 65536 | 28408644.3 | 14630748.5 |
| large/dense | shrink ab->a | 4194304 | 0 | 12771361.1 | 12659342.5 |
Interpretation:
The scan's relative cost is consistent across sizes (~30% dense, ~90% sparse), as expected for an O(n) pass — it scales with input length, not match count.
Sparse is the worst case (+90%): with few matches, the old replace pass is mostly cheap byte-copying, so adding a full second comparison pass nearly doubles the work. Dense (+30%): the replace pass already does expansion-copying, so the scan is a smaller fraction.
Shrink path: +0.9% — confirms the refinement works; when to ≤ from the scan is skipped and there's no measurable overhead.
Rationale for this change
Gandiva's REPLACE hardcodes a 65535-byte output buffer, throwing Buffer overflow for output string whenever the result exceeds 64 KB. The cap is arbitrary: Gandiva's variable-length output column already grows dynamically and is only bounded by the int32 offset width (~2 GB). Real queries that replace into large concatenated/aggregated strings fail unnecessarily.
What changes are included in this PR
replace_utf8_utf8_utf8 now sizes the output buffer to the exact result instead of using a fixed cap. The output length of a replace is deterministic:
out_len = text_len + num_matches * (to_str_len - from_str_len)
The wrapper does a single counting pass over the input to find the number of non-overlapping matches of from_str (mirroring the match loop already used in the implementation), computes the exact size in gdv_int64 to avoid intermediate overflow, and passes that as max_length.
The internal replace_with_max_len_utf8_utf8_utf8 is unchanged — its bounds checks now act purely as a correctness backstop (they should never fire with an exact bound), and its explicit-max-length signature remains for the existing unit tests.
When to is shorter than from, the result shrinks and max_length <= text_len, so the shrinking path is sized correctly too.
Are these changes tested?
Yes. Added regression cases to TestStringOps.TestReplace in string_ops_test.cc:
A 35000-char 'X' input with X → XY, producing a 70000-byte result (previously overflowed at 65535) — asserts no error and exact length/content.
A 70000-char shrinking case (XX → X) to cover the shrink path on a >64 KB input.
Full precompiled suite passes locally (132/132), including the existing explicit-max_len overflow tests, which call the internal function directly and are unaffected.
Are there any user-facing changes?
REPLACE now succeeds on results larger than 64 KB instead of erroring. No API or signature changes.